feat: orphan chunk state witness pool #10613

jancionear · 2024-02-14T21:01:29Z

Description

This PR adds a pool for orphaned ChunkStateWitnesses.
To process a ChunkStateWitness we need the previous block, but sometimes it isn't available immediately. The node might receive a ChunkStateWitness before the block that's required to process it. In such cases the witness becomes an "orphaned chunk state witness" and it's put in OrphanChunkStateWitnessPool, where it waits for the desired block to appear. Once a new block is accepted, we fetch all orphaned witnesses that were waiting for this block from the pool and process them.

Design of `OrphanStateWitnessPool`

OrphanStateWitnessPool keeps a cache which maps shard_id and height to an orphan ChunkStateWitness with these parameters:

witness_cache: LruCache<(ShardId, BlockHeight), ChunkStateWitness>,

All ChunkStateWitnesses go through basic validation before being put in the orphan cache.

The signature is checked to make sure that this witness really comes from the right chunk producer that should produce a witness at this height and shard_id.
Client keeps only witnesses which are within 5 blocks of the current chain head to prevent spam attacks. Without this limitation a single malicious chunk producer could fill the whole cache with their fake witnesses.
There's also a limitation on witness size to limit the amount of memory consumed by the pool. During StatelessNet loadtests performed by @staffik and @Longarithm the observed ChunkStateWitness sIze was 16-32MB, so a 40MB limit should be alright. This PR only limits the size of orphaned witnesses, limiting the size of non-orphan witnesses is much more tricky, see the discussion in Limit ChunkStateWitness size to 16MB #10615.

It's impossible to fully validate an orphaned witness, but this partial validation should be enough to protect against attacks on the orphan pool.

Under normal circumstances there should be only a few orphaned witnesses per shard. If the node has fallen behind by more than a few blocks, it has to catch up and its chunk endorsements don't matter.
The default cache capacity is set to 25 witnesses. With 5 shards it provides capacity for 5 orphaned witnesses on each shard, which should be enough.
Assuming that a single witness can take up 40 MB, the pool will consume at most 1GB at full capacity.

The changes are divided into individual commits, they can be reviewed commit-by-commit.

Fixes

Fixes: #10552
Fixes: near/stakewars-iv#15

codecov · 2024-02-14T21:25:47Z

Codecov Report

Attention: Patch coverage is 83.93393% with 107 lines in your changes are missing coverage. Please review.

Project coverage is 72.42%. Comparing base (461c1a9) to head (6e9b143).

Files	Patch %	Lines
..._validation/chunk_validator/orphan_witness_pool.rs	86.95%	38 Missing and 1 partial ⚠️
...idation/chunk_validator/orphan_witness_handling.rs	72.03%	20 Missing and 13 partials ⚠️
chain/chain/src/test_utils/kv_runtime.rs	0.00%	13 Missing ⚠️
chain/client/src/test_utils/test_env.rs	82.14%	10 Missing ⚠️
...nt/src/stateless_validation/chunk_validator/mod.rs	79.31%	6 Missing ⚠️
chain/epoch-manager/src/lib.rs	92.18%	0 Missing and 5 partials ⚠️
chain/client/src/client_actions.rs	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10613      +/-   ##
==========================================
+ Coverage   72.34%   72.42%   +0.07%     
==========================================
  Files         730      732       +2     
  Lines      149292   149916     +624     
  Branches   149292   149916     +624     
==========================================
+ Hits       108012   108578     +566     
- Misses      36374    36417      +43     
- Partials     4906     4921      +15

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.42% <1.11%> (-0.01%)`	⬇️
integration-tests	`36.97% <48.49%> (-0.04%)`	⬇️
linux	`71.22% <59.15%> (+<0.01%)`	⬆️
linux-nightly	`71.84% <83.93%> (+0.03%)`	⬆️
macos	`55.46% <59.15%> (+1.78%)`	⬆️
pytests	`1.64% <1.11%> (-0.01%)`	⬇️
sanity-checks	`1.43% <1.11%> (-0.01%)`	⬇️
unittests	`68.27% <60.51%> (+0.03%)`	⬆️
upgradability	`0.28% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sometimes when a ChunkStateWitness arrives to be processed, the block required to process it isn't available yet, we have to wait for it. near#10535 implemented a hacky way to do it by retrying the processing every 500ms until the required block arrives. This PR will implement a proper solution, so let's remove the hacky workaround.

The current function to verify state witness signatures takes the epoch_id from the previous block. For orphaned state witnesses the previous block isn't available, so it's impossible to use this function. Let's add another function that takes the epoch_id as an argument instead of fetching it from a block. It will allow us to verify signatures of orphaned chunk state witnesses.

Add a struct which keeps a cache of orphaned ChunkStateWitnesses. It provides two methods that allow to easily add a ChunkStateWitness to the pool, and then when a block arrrives, another method allows to easily fetch all orphaned ChunkStateWitnesses that were waiting for this block.

Let's split this function into two parts - one part which tries to find the previous block for the given ChunkStateWitness and another which processes the witness when the previous block is available. It will make handling orphaned witnesses easier - once the block arrives, we can just call the second function to process it. In the future it might also make it easier to process witnesses before the previous block is saved to the database, which could reduce latency.

Add a function which tries to determine in which epoch the specified height is located. It looks at the previous, current and next epoch around the chain Tip and tries to figure out to which one of them this height belongs. It's not always possible to determine in which epoch a given height will be, as the start height of the next epoch isn't known in the current epoch, so the function returns a set of possible epochs where the height might be. For example if the tip is in the middle of the current epoch, and the height is about epoch_length blocks higher than the tip, this function would return [current_epoch_id, next_epoch_id], as this height could be in either one of those epochs. It will later be used to verify the signature of orphaned witnesses. To verify the signature we need to know what is the chunk producer who should produce a witness at some height, and for that we need to know what's the epoch_id at this height.

When the previous block isn't available, we can't process the ChunkStateWitness and it becomes an orphaned state witness, waiting for the required block to arrive. In such situations the witness is partialy validated and then put in the `OrphanStateWitnessPool`. It's impossible to fully validate the witness, as the previous block isn't available, but we can do some basic checks to curb abusive behavior - check the shard_id, signature, size, etc.

When a new block is accepted we can process all orphaned state witnesses that were waiting for this block. Witnesses which became ready are taken out of the pool and processed one by one.

There is no ideal size of the witness pool, node operators might want to adjust it to achieve optimal performance. Let's make this possible by adding a configuration option which allows to control the pool size.

Add metrics which provide information about witnesses kept in the pool. Metrics are increased and decreased using a RAII struct to ensure that they're always correct, no matter what happens with the pool.

There was no way to get the index of a client with some AccountId, let's add a method that enables it to `TestEnv`

Add a function that allows to wait until some chunk endorsement is sent out. It will later be used to verify that a validator sends out endorsements for an orphaned chunk state witness when its block arrives.

`stateless_validation` is a private module of `near_client`, so it's impossible to use items defined there in integration tests. Let's expose some of them to make their use in tests possible.

`run_chunk_validation_test` has a very useful function which allows to determine a block producer at some offset. I would like to also use it in my integration tests. Let's add it to `TestEnv` so that other tests can also use it.

Add a function analogous to `get_block_producer_at_offset`, but for chunk producers.

jancionear · 2024-02-21T19:52:56Z

Attention: 93 lines in your changes are missing coverage. Please review.

The code coverage check has failed, but I reviewed the report and IMO everything important is covered by tests. The coverage report mainly complains about log messages or sanity checks that should never be triggered.

near#10613 (comment)

jancionear · 2024-02-22T19:29:20Z

@pugachAG I see that you self-requested a review.
I'll wait with merging, please go ahead if you want to do a review as well.

wacban · 2024-02-23T10:23:10Z

random thought: We may also want to clear the cache periodically. If I recall correctly currently we only remove the witness for which a block is received. This means we will never remove a witness from the cache unless we receive the block for it and eventually we'll always have full cache. I would suggest something like cleaning up witnesses with witness.height < head.height + threshold after processing a block (same place where you call process_ready_orphan_chunk_state_witnesses).

wacban

LGTM!

wacban · 2024-02-23T10:27:54Z

chain/client/src/stateless_validation/chunk_validator/orphan_witness_handling.rs

+                target: "client",
+                witness_height,
+                witness_shard,
+                witness_chunk = ?chunk_header.chunk_hash(),
+                witness_prev_block = ?chunk_header.prev_block_hash(),
+                witness_size,


I think it should, I believe the span log level only applies to when you have enabled logging of entering and closing spans.

wacban · 2024-02-23T10:32:56Z

chain/epoch-manager/src/lib.rs

+        // Finally try the previous epoch.
+        // First and last blocks of the previous epoch are already known, so the situation is clear.


I would expect that we already have the height -> epoch id mapping stored in the db in one form or another so we wouldn't need to actually walk the chain. But Longarithm's point is even better.

wacban · 2024-02-23T10:34:41Z

chain/epoch-manager/src/tests/mod.rs

+    );
+
+    let epoch1 = EpochId(h[0]);
+    dbg!(&epoch1);


Why not use tracing? It's used throughout the codebase and everyone is familiar with it. This is a mini nit given it's only a test so - optional :)

pugachAG · 2024-02-23T11:32:27Z

chain/client/src/stateless_validation/chunk_validator/mod.rs

+        prev_block: &Block,
+        processing_done_tracker: Option<ProcessingDoneTracker>,
+    ) -> Result<(), Error> {
+        if witness.inner.chunk_header.prev_block_hash() != prev_block.hash() {


this is one example where I think asserting is more appropriate than returning an error

the contract of this function already assumes block to correspond to witness.inner.chunk_header.prev_block_hash(), so violating that is a programmatic error, not an "invalid input" kind of error

having error handling here might give an impression that this could actually happened under some input, which is not true

returning error might mask an underlying issue and result in more time spent on debugging or completely hide it which is even worse

I don't know, I'm really scared of introducing code that could lead to DoS vulnerability. It might not be a big threat to normal services, but on a blockchain I imagine that a DoS means that it'd be possible to kick out validators and cause all sorts of mayhem.
IMO passing the wrong block to process_chunk_state_witness is not a fatal error - it's invalid input, for which the function can safely fail.
I added it because I know that there are plans to process witnesses before the block is applied, and I was worried that someone might accidentally use it incorrectly once the complexity of things increases.

I don't see a reason to risk a panic here. This error will show up in the logs with an ERROR log-level, and I think anyone debugging an issue looks for those, so the visibility is there. Maybe we could introduce a BUG log-level, that would be used for non-critical bugs?

feel free to merge the PR as it is and we can discuss this on protocol core weekly sync

chain/client/src/stateless_validation/chunk_validator/orphan_witness_handling.rs

pugachAG · 2024-02-23T11:58:02Z

chain/client/src/stateless_validation/chunk_validator/orphan_witness_pool.rs

+    /// List of orphaned witnesses that wait for this block to appear.
+    /// Maps block hash to entries in `witness_cache`.
+    /// Must be kept in sync with `witness_cache`.
+    waiting_for_block: HashMap<CryptoHash, HashSet<(ShardId, BlockHeight)>>,


Is it really worth maintaining this lookup map in the given context? the default size limit is 25 and I don't think it makes sense to have it much higher than that (if we so behind then there is very little chance to send the endorsement in time). I think just iterating over witness_cache would have been more than performant enough here.

Yeah, that's true.

After doing so much leetcode it feels very wrong to use a suboptimal algorithm, but there's no point in introducing so much complexity and expects when the cache holds 25 items. You're right, the rational choice is to use a naive algorithm.
Changed to a simpler implementation.

One thing I liked about the lookup map is that it provides a "safe" api - no matter what number someone passes, the cache will behave correctly. With a naive implementation using a higher value might lead to silent performance problems, so I added a warning if someone tries to create a cache with capacity of more than 128.

In previous commit I added debug spans to avoid writing witness_height, ... everywhere. I also removed them from the ejection message, but that was a mistake. The span contains information about the witness that is being added, not the one that is being ejected. Because of that the message about ejecting a witness would contain invalid witness_height and other parameters. Fix by using the correct parameters, and names which don't collide with the ones in the span.

I imported near-o11y to use the function for initializing a tracing-subscriber in epoch manager tests, but I didn't import the corresponding nightly packages, so a CI check failed. Let's fix this by importing the packages that the CI check prosposed.

jancionear · 2024-02-23T16:03:53Z

mini nit: Maybe make it configurable in case the witnesses turn out to be larger than we expect? Totally optional as long as the value has a safe margin over the expected max state witness size.

random thought: We may also want to clear the cache periodically. If I recall correctly currently we only remove the witness for which a block is received. This means we will never remove a witness from the cache unless we receive the block for it and eventually we'll always have full cache. I would suggest something like cleaning up witnesses with witness.height < head.height + threshold after processing a block (same place where you call process_ready_orphan_chunk_state_witnesses).

Opened issues about it, those features can be added in a follow-up PR, this one is big enough already.
#10649 #10650

Ran python3 scripts/fix_nightly_feature_flags.py fix to satisfy the CI

While resolving a merge conflict caused by near#10646, I accidentally removed a handler for chunk endorsement messages. I didn't mean to do it, I wanted to only modify handlers for chunk state witnesses. Let's undo the change and bring the chunk endorsement handler back.

This reverts commit 5cae3c3. It was actually necessary to remove the endorsement message handler, it doesn't compile otherwise 🤦

walnut-the-cat · 2024-02-23T19:06:20Z

Do we have logging to monitor how much cache is consumed (so we can adjust as needed)?

jancionear · 2024-02-23T19:09:30Z

Do we have logging to monitor how much cache is consumed (so we can adjust as needed)?

There are metrics for the orphan witness pool (see f9b9c52). I'll later add the corresponding dashboards in Grafana.

But without #10649 the cache might just be at full capacity all the time x.x.

shreyan-gupta · 2024-02-27T16:01:45Z

Missed reviewing this, but great PR!

jancionear added the A-stateless-validation Area: stateless validation label Feb 14, 2024

jancionear requested a review from Longarithm February 14, 2024 21:01

jancionear requested a review from a team as a code owner February 14, 2024 21:01

jancionear marked this pull request as draft February 14, 2024 21:02

jancionear force-pushed the orphan-witness branch 3 times, most recently from a2a8632 to 42caa7c Compare February 21, 2024 13:31

jancionear added 4 commits February 21, 2024 13:32

jancionear force-pushed the orphan-witness branch 3 times, most recently from 736bc2b to 3ea39f7 Compare February 21, 2024 19:20

jancionear added 12 commits February 21, 2024 19:20

Process orphaned witnesses when a new block is accepted

5a76c7c

When a new block is accepted we can process all orphaned state witnesses that were waiting for this block. Witnesses which became ready are taken out of the pool and processed one by one.

Allow to configure orphan pool size from config.js

97a0482

There is no ideal size of the witness pool, node operators might want to adjust it to achieve optimal performance. Let's make this possible by adding a configuration option which allows to control the pool size.

Add metrics to improve observability of OrphanStateWitnessPool

f9b9c52

Add metrics which provide information about witnesses kept in the pool. Metrics are increased and decreased using a RAII struct to ensure that they're always correct, no matter what happens with the pool.

Add unit tests for OrphanStateWitnessPool

42af070

Add TestEnv::get_client_index()

6592ace

There was no way to get the index of a client with some AccountId, let's add a method that enables it to `TestEnv`

Add TestEnv::wait_for_chunk_endorsement

a29b8df

Add a function that allows to wait until some chunk endorsement is sent out. It will later be used to verify that a validator sends out endorsements for an orphaned chunk state witness when its block arrives.

Expose some items that are needed for the integration tests

22a2b15

`stateless_validation` is a private module of `near_client`, so it's impossible to use items defined there in integration tests. Let's expose some of them to make their use in tests possible.

Move get_block_producer to TestEnv

9974f09

`run_chunk_validation_test` has a very useful function which allows to determine a block producer at some offset. I would like to also use it in my integration tests. Let's add it to `TestEnv` so that other tests can also use it.

Add TestEnv::get_chunk_producer_at_offset

dc19727

Add a function analogous to `get_block_producer_at_offset`, but for chunk producers.

Add integration tests

0183075

jancionear force-pushed the orphan-witness branch from 3ea39f7 to 0183075 Compare February 21, 2024 19:20

jancionear marked this pull request as ready for review February 21, 2024 19:48

Remove comment which no longer applies

0b126b8

Longarithm approved these changes Feb 22, 2024

View reviewed changes

jancionear added 2 commits February 22, 2024 19:25

Increase default orphan pool size to 25

a6858eb

near#10613 (comment)

I'm bad at math

0edf57a

wacban approved these changes Feb 23, 2024

View reviewed changes

pugachAG approved these changes Feb 23, 2024

View reviewed changes

jancionear added 4 commits February 23, 2024 14:26

Use tracing for logs in test_possible_epochs_of_height_around_tip

bb41b37

Use naive implementation of OrphanStateWitnessPool to reduce complexity

33248ac

Merge branch 'master' into orphan-witness

4e6711d

This was referenced Feb 23, 2024

Remove witnesses which are no longer needed from OrphanStateWitnessPool #10649

Closed

Make size limit of orphan witnesses configurable #10650

Closed

Fix the import of near-o11y

24b25f5

I imported near-o11y to use the function for initializing a tracing-subscriber in epoch manager tests, but I didn't import the corresponding nightly packages, so a CI check failed. Let's fix this by importing the packages that the CI check prosposed.

Fix formatting of epoch-manager/Cargo.toml

58be57b

Ran python3 scripts/fix_nightly_feature_flags.py fix to satisfy the CI

jancionear requested a review from pugachAG February 23, 2024 16:34

pugachAG approved these changes Feb 23, 2024

View reviewed changes

jancionear added 3 commits February 23, 2024 18:40

Merge branch 'master' into orphan-witness

056393d

Revert "fix: Don't remove the handler for ChunkEndorsementMessage"

6e9b143

This reverts commit 5cae3c3. It was actually necessary to remove the endorsement message handler, it doesn't compile otherwise 🤦

jancionear added this pull request to the merge queue Feb 23, 2024

Merged via the queue into near:master with commit 2158533 Feb 23, 2024
27 of 28 checks passed

jancionear deleted the orphan-witness branch February 23, 2024 21:20

jancionear mentioned this pull request Apr 5, 2024

refactor: state witness struct #10934

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: orphan chunk state witness pool #10613

feat: orphan chunk state witness pool #10613

jancionear commented Feb 14, 2024 •

edited

Loading

codecov bot commented Feb 14, 2024 •

edited

Loading

jancionear commented Feb 21, 2024

jancionear commented Feb 22, 2024 •

edited

Loading

wacban commented Feb 23, 2024

wacban left a comment

wacban Feb 23, 2024

wacban Feb 23, 2024

wacban Feb 23, 2024

pugachAG Feb 23, 2024

jancionear Feb 23, 2024

pugachAG Feb 23, 2024

pugachAG Feb 23, 2024

jancionear Feb 23, 2024 •

edited

Loading

jancionear commented Feb 23, 2024

walnut-the-cat commented Feb 23, 2024

jancionear commented Feb 23, 2024 •

edited

Loading

shreyan-gupta commented Feb 27, 2024

		// Finally try the previous epoch.
		// First and last blocks of the previous epoch are already known, so the situation is clear.

feat: orphan chunk state witness pool #10613

feat: orphan chunk state witness pool #10613

Conversation

jancionear commented Feb 14, 2024 • edited Loading

Description

Design of OrphanStateWitnessPool

Fixes

codecov bot commented Feb 14, 2024 • edited Loading

Codecov Report

jancionear commented Feb 21, 2024

jancionear commented Feb 22, 2024 • edited Loading

wacban commented Feb 23, 2024

wacban left a comment

Choose a reason for hiding this comment

wacban Feb 23, 2024

Choose a reason for hiding this comment

wacban Feb 23, 2024

Choose a reason for hiding this comment

wacban Feb 23, 2024

Choose a reason for hiding this comment

pugachAG Feb 23, 2024

Choose a reason for hiding this comment

jancionear Feb 23, 2024

Choose a reason for hiding this comment

pugachAG Feb 23, 2024

Choose a reason for hiding this comment

pugachAG Feb 23, 2024

Choose a reason for hiding this comment

jancionear Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

jancionear commented Feb 23, 2024

walnut-the-cat commented Feb 23, 2024

jancionear commented Feb 23, 2024 • edited Loading

shreyan-gupta commented Feb 27, 2024

jancionear commented Feb 14, 2024 •

edited

Loading

Design of `OrphanStateWitnessPool`

codecov bot commented Feb 14, 2024 •

edited

Loading

jancionear commented Feb 22, 2024 •

edited

Loading

jancionear Feb 23, 2024 •

edited

Loading

jancionear commented Feb 23, 2024 •

edited

Loading